Welcome to tutorial page of INEA-Batch-2 Group B Project.
Group-B will be working on the SDM (Species Distribution Modeling) of some target species such as leopard, or key species in Alaungdaw Kathapa National Park. But in this tutorial we will use camera trap data from the Htamanthi Wildlife Sanctuary.
In this tutorial, we’ll cover:
The course gives a brief overview of the concept of species distribution modelling, and introduces the main modelling steps. Codes and data largely follow the materials from Zurell and Engler (2019) although we will use a different case study.
Species distribution models (SDMs) are a popular tool in quantitative ecology (Franklin 2010; Peterson et al. 2011; Guisan, Thuiller, and Zimmermann 2017) and constitute the most widely used modelling framework in global change impact assessments for projecting potential future range shifts of species (IPBES 2016). There are several reasons that make them so popular: they are comparably easy to use because many software packages (e.g. Thuiller et al. 2009; Phillips, Anderson, and Schapire 2006) and guidelines (e.g. Elith, Leathwick, and Hastie 2008; Elith et al. 2011; Merow, Smith, and Silander Jr 2013; Guisan, Thuiller, and Zimmermann 2017) are available, and they have comparably low data requirements.
So, let’s get started in R. As mentioned previously, data used in this tutorial are leopard presence locations in Htamanthi Wildlife Sanctuary.
First of all, we need to load the required packages. If you are not installed them yet, please install them, first.
library(reshape2) #for re-formatting data; version 1.4.3 used
## Warning: package 'reshape2' was built under R version 4.4.2
library(mgcv) #for gams; version 1.8-24 used
## Loading required package: nlme
## This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.
library(dismo) #for SDMs; version 1.1-4 used
## Warning: package 'dismo' was built under R version 4.4.2
## Loading required package: raster
## Warning: package 'raster' was built under R version 4.4.2
## Loading required package: sp
## Warning: package 'sp' was built under R version 4.4.1
##
## Attaching package: 'raster'
## The following object is masked from 'package:nlme':
##
## getData
library(rJava) #for calling maxent from dismo (need Java installed); version 0.9-10 used
library(randomForest) #for random forest SDMs; version 4.6-14 used
## Warning: package 'randomForest' was built under R version 4.4.2
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
library(maxnet) #maxent with maxnet; version 0.1.2 used
## Warning: package 'maxnet' was built under R version 4.4.2
library(glmnet) #needed for maxnet; version 2.0-16 used
## Warning: package 'glmnet' was built under R version 4.4.2
## Loading required package: Matrix
## Loaded glmnet 4.1-8
library(MuMIn) #for model selection; version 1.42.1 used
## Warning: package 'MuMIn' was built under R version 4.4.2
##
## Attaching package: 'MuMIn'
## The following object is masked from 'package:randomForest':
##
## importance
library(PresenceAbsence) #for model evaluation; version 1.1.9 used
library(ecospat) #for model evaluation; version 3.0 used
## Warning: package 'ecospat' was built under R version 4.4.2
require(here)
## Loading required package: here
## Warning: package 'here' was built under R version 4.4.1
## here() starts at C:/git/leopard_distribution
In this step, the actual biodiversity and environmental data are gathered and processed. This concerns all data that are required for model fitting but also data that are used for making transfers. Special attention should be put on any scaling mismatches, meaning cases where the spatial (or temporal) grain or extent doe not match between biodiversity and environmental data or within environmental data. In these cases, we need to make decisions about adequate upscaling and downscaling strategies. Another important issue is the form of absence information available for the biodiversity data. Most SDM applications deal with some form of presence information for the species. These can be direct observations, GPS locations of data loggers, or museum records among others. All SDM algorithms require some form of absence or background information that they use as contrast to the presence data. Yet, absence data are rarely available. In such cases, adequate background data or pseudo-absence data needs to be selected. Again, the best strategy will depend on the research question, the data and the SDM algorithm (Guisan, Thuiller, and Zimmermann 2017). Finally, for later model assessment we may wish to partition the data into training data and validation data (Hastie, Tibshirani, and Friedman 2009).
lp.data <- read.csv(here("data", "lp_location_raw_v04.csv"))
lp.val <- read.csv(here("data", "lp_location_valid_v04.csv"))
#subset to presence-only / absence-only
lp.pres <- lp.data[lp.data$IDP_event>=1,]
lp.abs <- lp.data[lp.data$IDP_event==0,]
lp.pres.xy <- as.matrix(lp.pres[,c("longitude", "latitude")])
lp.abs.xy <- as.matrix(lp.abs[,c("longitude", "latitude")])
#validation data
lp.val.pres <- as.matrix(lp.val[lp.val$IDP_event>=1, c("longitude", "latitude")])
lp.val.abs <- as.matrix(lp.val[lp.val$IDP_event==0, c("longitude", "latitude")])
lp.val.xy <- as.matrix(lp.val[,c("longitude", "latitude")])
Here, we will use elevation, NDVI, slope, and other distance related covarites to be included in our modeling process.
#covariate maps
elev <- raster(here("data", "newtif","elev.tif")) #elevation layer
ndvi <- raster(here(
"data","newtif","ndvi.tif")) #Normalised Different Vegetation Index
slope <- raster(here(
"data","newtif","slp.tif")) #Degree of Slope
ndvi <- raster(here(
"data","newtif","ndvi.tif")) #Land Surface Temperature
dist_rd <- raster(here("data","newtif","distance_rasters", "d_rd.tif")) # Distance to Road
dist_water <- raster(here("data","newtif","distance_rasters","d_water.tif")) # Distance to Water
dist_patrol <- raster(here("data","newtif","distance_rasters", "d_patrol.tif")) # Distance to Patrol Stations
dist_vlg <- raster(here("data","newtif","distance_rasters","d_vlg.tif")) # Distance to Villages
canopy_h <- raster(here("data","newtif", "canopy_h_bdy.tif"))
bio1 <- raster(here("data", "newtif", "bio1.tif"))
bio2 <- raster(here("data", "newtif", "bio1.tif"))
proj4string(elev)
## [1] "+proj=longlat +datum=WGS84 +no_defs"
proj4string(ndvi)
## [1] "+proj=longlat +datum=WGS84 +no_defs"
proj4string(slope)
## [1] "+proj=utm +zone=47 +datum=WGS84 +units=m +no_defs"
proj4string(bio1)
## [1] "+proj=longlat +datum=WGS84 +no_defs"
proj4string(bio2)
## [1] "+proj=longlat +datum=WGS84 +no_defs"
proj4string(canopy_h)
## [1] "+proj=longlat +datum=WGS84 +no_defs"
proj4string(dist_rd)
## [1] "+proj=utm +zone=46 +datum=WGS84 +units=m +no_defs"
proj4string(dist_patrol)
## [1] "+proj=utm +zone=46 +datum=WGS84 +units=m +no_defs"
proj4string(dist_vlg)
## [1] "+proj=utm +zone=46 +datum=WGS84 +units=m +no_defs"
proj4string(dist_water)
## [1] "+proj=utm +zone=46 +datum=WGS84 +units=m +no_defs"
ndvi = projectRaster(ndvi, crs = elev)
slope= projectRaster(slope, crs=elev)
bio1= projectRaster(bio1, crs=elev)
bio2= projectRaster(bio2, crs=elev)
canopy_h= projectRaster(canopy_h, crs=elev)
dist_rd= projectRaster(dist_rd, crs=elev)
dist_patrol= projectRaster(dist_patrol, crs=elev)
dist_vlg= projectRaster(dist_vlg, crs=elev)
dist_water= projectRaster(dist_water, crs=elev)
elev <- resample(x=elev, y= dist_water, "bilinear")
ndvi <- resample(x=ndvi, y= dist_water, "bilinear")
slope <- resample(slope, dist_water, "bilinear")
bio1<- resample(bio1,dist_water, "bilinear")
bio2<- resample(bio2,dist_water, "bilinear")
canopy_h <- resample(canopy_h, dist_water, "bilinear")
dist_rd <- resample(dist_rd, dist_water, "bilinear")
dist_patrol <- resample(dist_patrol, dist_water, "bilinear")
dist_vlg <- resample(dist_vlg, dist_water, "bilinear")
compareRaster(elev,ndvi,slope, canopy_h, bio1, bio2, dist_patrol,dist_rd,dist_vlg,dist_water)
## [1] TRUE
layers <- stack(elev,ndvi,slope,canopy_h, bio1, bio2, dist_patrol,dist_rd,dist_vlg,dist_water)
names(layers) <- c("elev","ndvi","slope","canopy_h", "bio1", "bio2", "dist_patrol","dist_rd","dist_vlg","dist_water")
plot(layers)
pairs(layers)
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
## Warning in graphics::par(usr): argument 1 does not name a graphical parameter
back.xy <- randomPoints(layers, p=lp.pres.xy, n=100)
#inspect
head(back.xy)
## x y
## [1,] 95.67196 25.51195
## [2,] 95.73156 25.60138
## [3,] 95.29589 25.28594
## [4,] 95.60789 25.49732
## [5,] 95.93122 25.58106
## [6,] 95.52714 25.47591
#re-name columns
colnames(back.xy) <- c("longitude","latitude")
#plot
plot(elev)
points(back.xy)
pres.cov <- extract(layers, lp.pres.xy) #extracts values from layers at pres locations
back.cov <- extract(layers, back.xy) #extracts values from layers at random locations
val.cov <- extract(layers, lp.val.xy) #extracts values from layers at validation locations
pres.cov <- data.frame(lp.pres.xy, pres.cov, pres=1)
back.cov <- data.frame(back.xy, back.cov, pres=0)
val.cov <- data.frame(lp.val, val.cov)
#remove any potential NAs
pres.cov <- pres.cov[complete.cases(pres.cov),]
back.cov <- back.cov[complete.cases(back.cov),]
val.cov <- val.cov[complete.cases(val.cov),]
#bind presence and background points together
all.cov <- rbind(pres.cov, back.cov)
#inspect
head(all.cov)
## longitude latitude elev ndvi slope canopy_h bio1 bio2
## 1 95.382 25.264 208.2093 0.7561862 12.484785 27.51286 23.69296 23.69296
## 2 95.352 25.239 250.4088 0.7250182 2.620323 30.16228 23.69376 23.69376
## 3 95.365 25.222 256.8364 0.7090580 6.091401 32.61270 23.63927 23.63927
## 4 95.443 25.327 258.9281 0.7043266 7.160226 29.56064 23.63433 23.63433
## 5 95.482 25.296 244.3671 0.7217796 5.549306 31.09164 23.59514 23.59514
## 6 95.500 25.290 293.5323 0.7150859 3.888586 30.49536 23.49664 23.49664
## dist_patrol dist_rd dist_vlg dist_water pres
## 1 9998.600 5788.88080 10281.362 125.7307 1
## 2 7312.740 8929.79349 10016.485 1692.7454 1
## 3 8490.490 10314.47125 12324.110 1825.1379 1
## 4 6072.756 16.84628 9599.184 1086.8957 1
## 5 5024.195 71.71445 14773.551 1582.0540 1
## 6 4007.160 37.20399 16640.143 1946.0672 1
#Step:2: Models {.tabset}
Model fitting is the heart of any SDM application. Many different algorithms are available (Elith et al. 2006), and often several algorithms are combined into ensemble models or several candidate models with different candidate predictor sets are averaged (Hastie, Tibshirani, and Friedman 2009). The decisions on these matters should have been made during the conceptualisation phase. Important aspects to consider during the model fitting step are: how to deal with multicollinearity in the environmental data? How many variables should be included in the model (without overfitting) and how should we select these? Which model settings should be used? When multiple model algorithms or candidate models are fitted, how to select the final model or average the models? Do we need to test or correct for non-independence in the data (spatial or temporal autocorrelation, nested data)? If the goal is to derive binary predictions, which threshold should be used? More detailed descriptions on these aspects can be found in Franklin (2010) and in Guisan, Thuiller, and Zimmermann (2017).
Exploration of model behaviour is strictly part of the model assessment step, e.g. checking the plausibility of the fitted species-environment relationship by visual inspection of response curves, and by assessing model coefficients and variable importance. However, for simplicity, we simultaneously look at model fitting and visualise model behaviour here.
Let’s start with fitting a simple GLM. We can fit linear, quadratic or higher polynomial terms (check poly()) and interactions between predictors: - the term I()indicates that a variable should be transformed before being used as predictor in the formula - poly(x,n) creates a polynomial of degree n : x+x2+…+xn
glm.lp <- glm(pres~ elev+ndvi
+slope+canopy_h+dist_patrol+dist_rd+dist_vlg+dist_water, family=binomial(link=logit), data=all.cov)
#inspect
summary(glm.lp)
##
## Call:
## glm(formula = pres ~ elev + ndvi + slope + canopy_h + dist_patrol +
## dist_rd + dist_vlg + dist_water, family = binomial(link = logit),
## data = all.cov)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.718e+00 6.196e+00 -1.084 0.278282
## elev 2.252e-02 5.882e-03 3.829 0.000129 ***
## ndvi 4.129e+00 7.942e+00 0.520 0.603135
## slope -7.353e-02 6.886e-02 -1.068 0.285638
## canopy_h 1.907e-02 5.297e-02 0.360 0.718876
## dist_patrol -1.851e-04 4.465e-05 -4.145 3.39e-05 ***
## dist_rd -2.761e-04 6.051e-05 -4.562 5.07e-06 ***
## dist_vlg 7.763e-05 5.973e-05 1.300 0.193691
## dist_water -3.069e-04 3.034e-04 -1.012 0.311743
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 221.26 on 164 degrees of freedom
## Residual deviance: 137.15 on 156 degrees of freedom
## AIC: 155.15
##
## Number of Fisher Scoring iterations: 5
#mapping
glm.map <- predict(layers, glm.lp, type="response")
#plot
plot(glm.map, axes=F, box=F, main="GLM")
points(lp.pres.xy)
plot(glm.lp) # check the model fitness
Random forests use a bagging procedure for averaging the outputs of many different CARTs (classification and regression trees)(Liaw and Wiener 2002). Bagging stands for “bootstrap aggregation”. Basically, we fit many CARTs to bootstrapped samples of the training data and then either average the results in case of regression trees or make a simple vote in case of classification trees (committee averaging)(Hastie, Tibshirani, and Friedman 2009; Guisan, Thuiller, and Zimmermann 2017). An important feature of random forests are the out-of-bag samples, which means that the prediction/fit for a specific data point is only derived from averaging trees that did not include this data point during tree growing. Thus, the output of Random Forests is essentially cross-validated. Random forests estimate variable importance by a permutation procedure, which measures for each variable the drop in mean accuracy when this variable is permutated.
#random forest model (default)
rf.lp <- randomForest(as.factor(pres) ~ elev+ndvi
+slope+canopy_h+dist_patrol+dist_rd+dist_vlg+dist_water, na.action=na.omit, data=all.cov)
#tuning model
rf.lp.tune <- tuneRF(y=as.factor(all.cov$pres), x = all.cov[,c(3:6)], stepFactor=0.5, ntreeTry=500)
## mtry = 2 OOB error = 32.12%
## Searching left ...
## mtry = 4 OOB error = 33.94%
## -0.05660377 0.05
## Searching right ...
## mtry = 1 OOB error = 33.94%
## -0.05660377 0.05
#update rf model with mtry=1 based on tuning
rf.lp <- randomForest(as.factor(pres) ~ elev+ndvi
+slope+canopy_h+dist_patrol+dist_rd+dist_vlg+dist_water, mtry=1, ntree=500, na.action=na.omit, data=all.cov)
#variable importance plot
varImpPlot(rf.lp)
#mapping
rf.map <- predict(layers, rf.lp, type="prob",index=2)
#plot
plot(rf.map, axes=F, box=F, main="RF")
points(lp.pres.xy)
##2.3 Maxent
MAXENT is now a common species distribution modeling (SDM) tool used by conservation practitioners for predicting the distribution of a species from a set of records and environmental predictors.
for Maxent to run, place the maxent.jar file in the following directory:
system.file("java",package="dismo")
## [1] "C:/Users/minhe/AppData/Local/R/win-library/4.4/dismo/java"
max.lp <- maxent(layers, p=lp.pres.xy)
summary(max.lp)
## Length Class Mode
## 1 MaxEnt S4
max.lp <- maxent(layers, p=lp.pres.xy, a=back.xy)
maxent.beta.3 <- maxent(layers, p=lp.pres.xy, a=back.xy,
args=c("betamultiplier=0.3"))
maxent.beta3 <- maxent(layers, p=lp.pres.xy, a=back.xy,
args=c("betamultiplier=3"))
maxent.features <- maxent(layers, p=lp.pres.xy, a=back.xy,
args=c("noproduct", "nohinge","nothreshold","noautofeature"))
eval.max <- evaluate(p=lp.val.pres, a=lp.val.abs, max.lp, layers)
eval.max3 <- evaluate(p=lp.val.pres, a=lp.val.abs, maxent.beta3, layers)
eval.maxfeatures <- evaluate(p=lp.val.pres, a=lp.val.abs, maxent.features, layers)
eval.max
## class : ModelEvaluation
## n presences : 12
## n absences : 16
## AUC : 1
## cor : 0.9652173
## max TPR+TNR at : 0.536247
eval.max3
## class : ModelEvaluation
## n presences : 12
## n absences : 16
## AUC : 1
## cor : 0.9691336
## max TPR+TNR at : 0.6256913
eval.maxfeatures
## class : ModelEvaluation
## n presences : 12
## n absences : 16
## AUC : 1
## cor : 0.9824327
## max TPR+TNR at : 0.6341303
response(max.lp, expand=0)
response(maxent.beta.3, expand=0)
response(maxent.beta3, expand=0)
response(maxent.features, expand=0)
max.map <- predict(layers, max.lp)
### plot
plot(max.map, axes=F, box=F, main="Maxent")
max.raw.map <- predict(layers, max.lp, args="outputformat=raw")
### plot
plot(max.raw.map, axes=F, box=F, main="Maxent-raw")
cellStats(max.raw.map, mean)
## [1] 0.002532264
#Step:3: K-fold validation
In the model assessment step, we analyse the fitted model in depth. Strictly, checking the plausibility of the fitted species-environment relationship by visual inspection of response curves, and by assessing model coefficients and variable importance would also be part of the model assessment. However, to better understand what the different model algorithms were doing, we already explored this step during model fitting. Another crucial aspect of model assessment, which we will look at in more detail here, is assessing the predictive performance for a set of validation or test data (Hastie, Tibshirani, and Friedman 2009).
Next, we assess k-fold validation model performance. We inspect different measures: AUC, the area under the receiver operating characteristic (ROC) curve (Hosmer and Lemeshow 2013); TSS, the true skill statistic (Allouche, Tsoar, and Kadmon 2006); sensitivity, the true positive rate; and specificity, the true negative rate. Simultaneously, we estimate the optimal threshold for making binary predictions. For this, we use a threshold that maximises TSS (= maximises the sum of sensitivity and specificity) (Liu et al. 2005).
require(glmnet)
summary.eval.kfold <- data.frame(matrix(nrow=0, ncol=11))
names(summary.eval.kfold) <- c("model", "k", "auc", "corr", "ll", "boyce",
"threshold", "sens", "spec", "tss", "kappa")
folds <- 5 #number of k-folds considered
kfold_pres <- kfold(pres.cov, k=folds)
kfold_back <- kfold(back.cov, k=folds)
for(k in 1:folds){
#partition data into folds
kfold <- k
val.pres.k <- pres.cov[kfold_pres == kfold, ]
val.back.k <- back.cov[kfold_back == kfold, ]
val.k <- rbind(val.pres.k, val.back.k)
val.k.cov <- val.k[,cbind("elev","ndvi","slope","canopy_h", "bio1", "bio2", "dist_patrol","dist_rd","dist_vlg","dist_water")]
train.pres.k <- pres.cov[kfold_pres != kfold, ]
train.back.k <- back.cov[kfold_back != kfold, ]
train.k <- rbind(train.pres.k, train.back.k)
#models fit to fold
glm.k <- glm(pres~elev+ndvi
+slope+canopy_h+dist_patrol+dist_rd+dist_vlg+dist_water,
family=binomial(link=logit), data=train.k)
rf.k <- randomForest(as.factor(pres) ~ elev+ndvi
+slope+canopy_h+dist_patrol+dist_rd+dist_vlg+dist_water, data=train.k, importance=F, ntree=500, mtry=1, na.action=na.omit)
max.k <- maxent(layers, p=train.pres.k[,1:2], a=train.back.k[,1:2])
#predictions for evaluation
glm.val <- predict(glm.k,val.k.cov, type="response")
rf.val <- predict(rf.k,val.k.cov, type="prob")
rf.val <- rf.val[,2]
max.val <- predict(max.k, val.k.cov)
#evaluate model on fold
val.data <- data.frame(siteID=1:nrow(val.k), obs=val.k$pres,
glm=glm.val, rf=rf.val, max=max.val)
for(i in 1:nmodels){
#calculate metrics for fold
auc.i <- auc(val.data, which.model=i)
kappa.opt <- optimal.thresholds(val.data, which.model=i, opt.methods=3)
sens.i <- sensitivity(cmx(val.data, which.model=i,threshold = kappa.opt[[2]]))
spec.i <- specificity(cmx(val.data, which.model=i,threshold = kappa.opt[[2]]))
tss.i<- sens.i$sensitivity +spec.i$specificity - 1
kappa.i <- Kappa(cmx(val.data, which.model=i,threshold = kappa.opt[[2]]))
corr.i<-cor.test(val.data[,2],val.data[,i+2])$estimate
ll.i <- sum(log(val.data[,i+2]*val.data[,2] + (1-val.data[,i+2])*(1-val.data[,2])))
ll.i <- ifelse(ll.i=="-Inf", sum(log(val.data[,i+2]+0.001)*val.data[,2] + log((1-val.data[,i+2]))*(1-val.data[,2])),ll.i)
boyce.i <- ecospat.boyce(fit=val.data[,i+2],obs=val.data[1:nrow(val.pres.k),i+2],res=100,PEplot = F)
#summarize
summary.i <- c(i,k,auc.i$AUC,corr.i,ll.i, boyce.i$Spearman.cor, kappa.opt[[2]],sens.i$sensitivity,spec.i$specificity,tss.i,kappa.i[[1]])
summary.eval.kfold <- rbind(summary.eval.kfold, summary.i)
}
print(k)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
names(summary.eval.kfold) <- c("model", "k", "auc", "corr", "ll", "boyce",
"threshold", "sens", "spec", "tss")
require(plyr)
## Loading required package: plyr
## Warning: package 'plyr' was built under R version 4.4.1
##
## Attaching package: 'plyr'
## The following object is masked from 'package:here':
##
## here
summary.eval.kfold
## model k auc corr ll boyce threshold sens spec
## 1 1 1 0.9076923 0.7484967 -13.29936 0.510 1.0000000 0.80 0.8000000
## 2 2 1 1.0000000 0.8858886 -10.20160 0.555 1.0000000 1.00 1.0000000
## 3 3 1 0.9461538 0.7784785 -11.82644 0.540 1.0000000 0.85 0.8500000
## 4 1 2 0.8500000 0.5966658 -16.02960 0.675 0.6923077 0.90 0.5923077
## 5 2 2 0.9192308 0.7542439 -12.34299 0.490 0.9230769 0.95 0.8730769
## 6 3 2 0.9115385 0.7140280 -13.56033 0.490 0.7692308 0.95 0.7192308
## 7 1 3 0.7384615 0.3772249 -22.45122 0.575 0.6153846 0.80 0.4153846
## 8 2 3 0.8480769 0.6390443 -15.23593 0.305 0.8461538 0.75 0.5961538
## 9 3 3 0.8230769 0.5768042 -15.84010 0.490 0.6153846 0.90 0.5153846
## 10 1 4 0.7923077 0.5215505 -20.03856 0.260 0.5384615 0.90 0.4384615
## 11 2 4 0.8692308 0.6540369 -15.12457 0.380 0.6923077 1.00 0.6923077
## 12 3 4 0.8538462 0.6570615 -15.19764 0.275 0.6923077 0.95 0.6423077
## 13 1 5 0.8692308 0.6621747 -14.17743 0.775 0.7692308 1.00 0.7692308
## 14 2 5 0.9673077 0.8334597 -12.14316 0.480 0.8461538 1.00 0.8461538
## 15 3 5 0.9346154 0.7773976 -12.32909 0.520 0.8461538 0.95 0.7961538
## tss
## 1 0.7591241
## 2 1.0000000
## 3 0.8170055
## 4 0.6086957
## 5 0.8730769
## 6 0.7391304
## 7 0.4210526
## 8 0.5730129
## 9 0.5370741
## 10 0.4634146
## 11 0.7317073
## 12 0.6693387
## 13 0.8016032
## 14 0.8695652
## 15 0.8070175
#average across folds
(summary.eval.kfold.round <- round(ddply(summary.eval.kfold, .(model), summarise,
auc = mean(auc),
cor = mean(corr),
ll = mean(ll),
boyce = mean(boyce),
threshold = mean(threshold),
tss = mean(tss),
kappa = mean(kappa)
),3))
## Warning in mean.default(kappa): argument is not numeric or logical: returning
## NA
## Warning in mean.default(kappa): argument is not numeric or logical: returning
## NA
## Warning in mean.default(kappa): argument is not numeric or logical: returning
## NA
## model auc cor ll boyce threshold tss kappa
## 1 1 0.832 0.581 -17.199 0.559 0.723 0.611 NA
## 2 2 0.921 0.753 -13.010 0.442 0.862 0.809 NA
## 3 3 0.894 0.701 -13.751 0.463 0.785 0.714 NA
summary.eval.kfold.round$model <- c("glm", "rf", "max")
#Step:4: Ensembles
We can also combine predictions from the two SDM algorithms and make an ensemble prediction, for example by taking the median.
models <- stack(glm.map, rf.map, max.map)
names(models) <- c("glm","rf", "max")
AUC.rf <- summary.eval.kfold.round[summary.eval.kfold.round$model=="rf", "auc"]
AUC.glm <- summary.eval.kfold.round[summary.eval.kfold.round$model=="glm", "auc"]
AUC.max <- summary.eval.kfold.round[summary.eval.kfold.round$model=="max", "auc"]
auc.weight <- c(AUC.glm, AUC.rf, AUC.max)
#AUC-based ensemble
ensemble.auc <- weighted.mean(models, auc.weight)
#plot
plot(ensemble.auc)
#Get thresholds identified in PresenceAbsence
thres.rf <- summary.eval.kfold.round[summary.eval.kfold.round$model=="rf", "threshold"]
thres.glm <- summary.eval.kfold.round[summary.eval.kfold.round$model=="glm", "threshold"]
thres.max <- summary.eval.kfold.round[summary.eval.kfold.round$model=="max", "threshold"]
#Create binary maps
rf.thres.map <- rf.map
glm.thres.map <- glm.map
max.thres.map <- max.map
values(rf.thres.map) <- 0
values(glm.thres.map) <- 0
values(max.thres.map) <- 0
rf.thres.map[rf.map > thres.rf] <- 1
glm.thres.map[glm.map > thres.glm] <- 1
max.thres.map[max.map > thres.max] <- 1
#plot
plot(rf.thres.map)
plot(glm.thres.map)
plot(max.thres.map)
#Ensemble mapping based on frequency
ensemble.freq <- rf.thres.map + glm.thres.map + max.thres.map
#plot
plot(ensemble.freq)
points(lp.pres.xy)
points(lp.val.pres)
Now that we carefully fitted the SDMs, inspected model and extrapolation behaviour, and assessed predictive performance, it is finally time to make predictions in space and time. Importance points to consider here are quantification of uncertainty due to input data, algorithms, model complexity and boundary conditions (e.g. climate scenarios)(Araújo et al. 2019; Thuiller et al. 2019). When transferring the model to a different geographic area or time period, it is also recommended to quantify uncertainty due to novel environments (Zurell, Elith, and Schroeder 2012).
par(mfrow = c(2,2))
plot(glm.map, main = "Prediction of Leopard Distribution by GLM")
plot(rf.map, main = "Prediction of Leopard Distribution by Random Forest")
plot(max.map, main = "Prediction of Leopard Distribution by MaxEnt")
plot(ensemble.auc, main = "Prediction of Leopard Distribution \n Average weighted model by both GLM, Random Forest and MaxEnt")
{plot(ensemble.auc, main = "Presence detection of Leopard in 2014-2020")
points(lp.pres.xy, pch=16)}
Threashold Maps aka binary map are also useful for the visual interpretation.
require(sf)
## Loading required package: sf
## Warning: package 'sf' was built under R version 4.4.2
## Linking to GEOS 3.12.2, GDAL 3.9.3, PROJ 9.4.1; sf_use_s2() is TRUE
ensemble.thres <- rf.thres.map + max.thres.map
hmt_bd <- st_read("C:/git/leopard_distribution/data/Boundary/Htamanthi_BD.shp")
## Reading layer `Htamanthi_BD' from data source
## `C:\git\leopard_distribution\data\Boundary\Htamanthi_BD.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 1 feature and 13 fields
## Geometry type: POLYGON
## Dimension: XYZ
## Bounding box: xmin: 95.27255 ymin: 25.06531 xmax: 95.93139 ymax: 25.73413
## z_range: zmin: 0 zmax: 0
## Geodetic CRS: WGS 84
#hmt_shp <- readOGR("C:/git/leopard_distribution/data/Boundary/Htamanthi_BD.shp")
par(mfrow = c(2,2))
plot(rf.thres.map, main = "Thresholds of Leopard Distribution by Random Forest")
plot(max.thres.map, main = "Thresholds of Leopard Distribution by MaxEnt")
plot(ensemble.thres, main = "Thresholds of Leopard Distribution \n Average weighted model by both GLM, Random Forest and MaxEnt")
{plot(ensemble.thres, main = "Presence detection of Leopard in 2018-2019")
points(lp.pres.xy, pch=16)}
Let’s make them pretty!!!
ensemble_spdf <- rasterToPoints((ensemble.thres$layer))
ensemble_df <- data.frame(ensemble_spdf)
head(ensemble_df)
## x y layer
## 1 95.24851 25.7564 0
## 2 95.24880 25.7564 0
## 3 95.24910 25.7564 0
## 4 95.24940 25.7564 0
## 5 95.24970 25.7564 0
## 6 95.25000 25.7564 0
head(lp.pres.xy)
## longitude latitude
## 1 95.382 25.264
## 2 95.352 25.239
## 3 95.365 25.222
## 4 95.443 25.327
## 5 95.482 25.296
## 6 95.500 25.290
point_sf = st_as_sf(lp.pres, coords = c("longitude", "latitude"))
st_crs(point_sf) = crs(layers)
require(ggplot2)
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.4.1
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
p4 <- ggplot() +
geom_raster(data = ensemble_df, mapping = aes(x = x, y = y, fill = layer)) +
scale_fill_gradient2(low = "transparent", mid = "darkblue", high = "darkgreen", midpoint = 1, name = "No. of models \npredicted") +
geom_sf(data = hmt_bd, colour = "black", fill = NA)+
geom_sf(data = point_sf, color = "red")+
xlab("Longitude") +
ylab("Latitude")+
ggtitle("Leopard Distribution Prediction Thresholds \nby GLM, Random Forest and MaxEnt Models") +
coord_sf()
Prediction Model AUC Esamble
auc_ensemble_spdf <- rasterToPoints((ensemble.auc$layer))
auc_ensemble_df <- data.frame(auc_ensemble_spdf)
head(auc_ensemble_df)
## x y layer
## 1 95.69640 25.74312 0.10727285
## 2 95.69670 25.74312 0.08162694
## 3 95.69700 25.74312 0.09581241
## 4 95.69729 25.74312 0.08859251
## 5 95.69759 25.74312 0.07789990
## 6 95.69789 25.74312 0.08778781
p3<- ggplot() +
geom_raster(data = auc_ensemble_df, mapping = aes(x = x, y = y, fill = layer)) +
scale_fill_viridis_c(name="Probability of \nLeopard \npresence")+
geom_sf(data = hmt_bd, colour = "black", fill = NA)+
geom_sf(data = point_sf, color = "red")+
xlab("Longitude") +
ylab("Latitude")+
ggtitle("Leopard Distribution Prediction AUC weighted \nby GLM, Random Forest and MaxEnt Models") +
coord_sf()
Prediction Random Forest
rf_spdf <- rasterToPoints((rf.map))
rf_df <- data.frame(rf_spdf)
head(rf_df)
## x y layer
## 1 95.69640 25.74312 0.144
## 2 95.69670 25.74312 0.090
## 3 95.69700 25.74312 0.158
## 4 95.69729 25.74312 0.128
## 5 95.69759 25.74312 0.084
## 6 95.69789 25.74312 0.102
p1 <- ggplot() +
geom_raster(data = rf_df, mapping = aes(x = x, y = y, fill = layer)) +
scale_fill_viridis_c(name="Probability of \nLeopard \npresence")+
geom_sf(data = hmt_bd, colour = "black", fill = NA)+
geom_sf(data = point_sf, color = "red")+
xlab("Longitude") +
ylab("Latitude")+
ggtitle("Leopard Distribution Prediction by Random Forest") +
coord_sf()
Prediction MaxEnt
max_spdf <- rasterToPoints((max.map))
max_df <- data.frame(max_spdf)
head(max_df)
## x y layer
## 1 95.69640 25.74312 0.1611708
## 2 95.69670 25.74312 0.1434253
## 3 95.69700 25.74312 0.1168999
## 4 95.69729 25.74312 0.1260262
## 5 95.69759 25.74312 0.1395614
## 6 95.69789 25.74312 0.1507249
p2 <- ggplot() +
geom_raster(data = max_df, mapping = aes(x = x, y = y, fill = layer)) +
scale_fill_viridis_c(name="Probability of \nLeopard \npresence")+
geom_sf(data = hmt_bd, colour = "black", fill = NA)+
geom_sf(data = point_sf, color = "red")+
xlab("Longitude") +
ylab("Latitude")+
ggtitle("Leopard Distribution Prediction by MaxEnt") +
coord_sf()
p2
Prediction GLM
glm_spdf <- rasterToPoints((glm.map))
glm_df <- data.frame(glm_spdf)
head(glm_df)
## x y layer
## 1 95.69640 25.74312 0.008702545
## 2 95.69670 25.74312 0.005954660
## 3 95.69700 25.74312 0.004313628
## 4 95.69729 25.74312 0.004746292
## 5 95.69759 25.74312 0.004890765
## 6 95.69789 25.74312 0.004428246
p6 <- ggplot() +
geom_raster(data = glm_df, mapping = aes(x = x, y = y, fill = layer)) +
scale_fill_viridis_c(name="Probability of \nLeopard \npresence")+
geom_sf(data = hmt_bd, colour = "black", fill = NA)+
geom_sf(data = point_sf, color = "red")+
xlab("Longitude") +
ylab("Latitude")+
ggtitle("Leopard Distribution Prediction by GLM") +
coord_sf()
p6
library(patchwork)
## Warning: package 'patchwork' was built under R version 4.4.2
##
## Attaching package: 'patchwork'
## The following object is masked from 'package:raster':
##
## area
p6 + p1 + p2+p3+p4 + plot_layout(ncol = 2)
prediction.glm <- extract(glm.map, lp.val.xy)
glm.df <- data.frame(predict = prediction.glm, val.cov)
p.glm.elev <- ggplot(glm.df, aes(elev, predict)) +
geom_point(alpha = 0.3, colour = "blue")+
geom_smooth(span = 50)+
ggtitle("a)") +
ylim(-1, 2)+
xlab("Elevation (m)")+
ylab("Probability of Species Presence")
p.glm.slope <- ggplot(glm.df, aes(Slope, predict)) +
geom_point(alpha = 0.3, colour = "blue")+
geom_smooth(span = 50)+
ggtitle("a)") +
ylim(-1, 2)+
xlab("Slope (Degree)")+
ylab("Probability of Species Presence")
p.glm.ndvi <- ggplot(glm.df, aes(ndvi, predict)) +
geom_point(alpha = 0.3, colour = "blue")+
geom_smooth(span = 50)+
ggtitle("a)") +
ylim(-1, 2)+
xlab("Elevation (m)")+
ylab("Probability of Species Presence")
p.glm.canopy <- ggplot(glm.df, aes(canopy_h, predict)) +
geom_point(alpha = 0.3, colour = "blue")+
geom_smooth(span = 50)+
ggtitle("a)") +
ylim(-1, 2)+
xlab("canopy height (m)")+
ylab("Probability of Species Presence")
p.glm.d_road <- ggplot(glm.df, aes(dist_rd, predict)) +
geom_point(alpha = 0.3, colour = "blue")+
geom_smooth(span = 50)+
ggtitle("a)") +
ylim(-1, 2)+
xlab("distance to road (m)")+
ylab("Probability of Species Presence")
Thank you very much!